Project Progress¶

link to notebook: https://github.com/navdeepnatt41/418-Group_Project/blob/main/ProgressReport.ipynb

Project introduction¶

With the increase in media attention on flight accidents and airplane issues this past year, we wanted to study whether the increase in media coverage is due to heightened scrutiny or a true rise in incidents. With over 45,000 flights being handled by the Federal Aviation Agency (FAA0 daily and over 2.9 million passengers, people expect to feel safe and secure when boarding their flights. Aviation is statistically one of the safest modes of transportation, but recent news has influenced general feelings regarding air travel. This leads us to ask: Is the rise in media attention due to a real increase in aviation incidents, or is it a result of heightened scrutiny and reporting?

To study flight accident and incident trends, we found data from the National Transportation and Security Board (NTSB). The NTSB provides data for each month with data information on each accident reported. The dataset includes a wide range of variables, such as: Number and type of injuries Probable cause of the incident Geographic coordinates (latitude and longitude) Aircraft make and model Operator Weather conditions at the time of the event Extent of aircraft damage Phase of flight (takeoff, cruise, landing, etc.) By analyzing trends across these variables, we aim to determine whether the rate of flight incidents has actually increased or if the perception of danger is being amplified by media coverage. Our goal is to provide data-driven insight into the current state of aviation safety and help distinguish between perception and reality.

Changes since proposal¶

Given that the dataset only included incident reports, we quickly recognized the limitations in assessing the prevalence of social media-related incidents or comparing flights that had incidents compared to flights that did not result in an incident. As a result, we narrowed the scope of our analysis. Instead, we focused on identifying potential factors that may increase the likelihood of an incident occurring.

Data¶

We attempted to download a JSON/CSV file with the data over a 10-year period, however after doing so we quickly realized that a lot of information relevant to our project was missing due to how NTSB had compiled the data over the 10-year period. We decided to download the data over each year and then with Pandas, we were able to merge each year together into one dataframe to get 10 years worth of data . By doing so, we ensured that all the data that we were seeing in the a month CSV file would also be in the 10-year CSV without any missing columns.

In [13]:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
In [ ]:
# take in the downloaded csv data over a 10-year period and merge it into one database
# export the database to a CSV for the rest of the group to download 
db15 = pd.read_csv('2015.csv')
db16 = pd.read_csv('2016.csv')
db17 = pd.read_csv('2017.csv')
db18 = pd.read_csv('2018.csv')
db19 = pd.read_csv('2019.csv')
db20 = pd.read_csv('2020.csv')
db21 = pd.read_csv('2021.csv')
db22 = pd.read_csv('2022.csv')
db23 = pd.read_csv('2023.csv')
db24 = pd.read_csv('2024.csv')
db25 = pd.read_csv('2025.csv')
# concatenating the data over each year into one dataframe 
frames = [db15, db16, db17, db18, db19, db20, db21, db22, db23, db24, db25]
mergedResults = pd.concat(frames)
# merging the results into a CSV 
mergedResults.to_csv('Merged10yrdata.csv', index=False)

Explatory Data Analysis¶

For our Explatory Data Analysis, we all came up with 3 different visualization ideas then came together to choose 1 to work on for the time being. The following were chosen and are included in this report:

Angela: Create a visualization that compares Makes of planes and the broad phase of flight (approach, initial climb, landing) [STACKED BAR CHART]

Nav: Plot accident data over a set of years with respect to location. We could plot accident points on an Earth map and also study which regions comparatively got the most accidents over time [SCATTERPLOT]

Oli: Time between accident date and report publication with a heatmap or bar chart visualization to identify if reports are being published more quickly in recent years, potentially indicating that there is increased scrutiny. [HEATMAP]

Lora: Line graph of number of flight incidents over time. Use separate lines for fatalities, non fatal injuries, and non-injuries [LINE GRAPH]

Yesui: Flight operator (company) vs. number of accidents (look at number of accidents per operator over total number) [BAR CHART]

Angela's Visualization¶

Angela began by creating a visualization to examine which phases of flight—such as takeoff, en route, and landing—were most frequently associated with incidents. The initial analysis focused on five aircraft manufacturers and revealed a notably higher incident rate during the landing phase for Cessna and Piper aircraft. Upon further investigation, it became clear that these two companies primarily produce smaller, general aviation aircraft used for personal, private, and business purposes. Since the original hypothesis was centered on commercial aviation, Angela refined her analysis by narrowing the focus to Boeing and Airbus, the two leading commercial aircraft manufacturers.

The updated visualization revealed that for Boeing and Airbus aircraft, incidents were more likely to occur during landing and en route. This pattern raised questions about contributing factors, such as prior damage to the aircraft, inconsistent maintenance routines, weather conditions, bird strikes, or even human error. These findings suggest that while the phase of flight is a key variable, a deeper exploration into other factors is needed to fully understand the causes of these incidents.

In [ ]:
#Create a visualization that compares Makes of planes  and the broad phase of flight (approach, initial climb, landing) [STACKED BAR CHART]

# cleaning up the data 
sortedDB = pd.read_csv('Merged10yrdata.csv', low_memory=False)
sortedDB['Make'] = sortedDB['Make'].str.lower()
sortedDB['BroadPhaseofFlight'] = sortedDB['BroadPhaseofFlight'].str.lower()
# sort out incidents that only occurred in the US and are categorized as AIR
filteredOutCountry = sortedDB[(sortedDB['Country'] == 'United States') & (sortedDB['AirCraftCategory'] == 'AIR')]
# group by make and phase of flight 
results = filteredOutCountry[['EventDate','Make', 'BroadPhaseofFlight']].groupby(['Make', 'BroadPhaseofFlight']).count().reset_index()
# rename columns
results = results.rename(columns={'EventDate': 'Count'})
# results.to_csv('results.csv', index=False) 

sumByPhase = results.groupby('BroadPhaseofFlight')['Count'].sum().reset_index()

sumByPhase = sumByPhase.sort_values(by='Count', ascending=False)
# print(sumByPhase)

plot = sns.barplot(x='BroadPhaseofFlight', y='Count', data= sumByPhase)
plt.xticks(rotation=90)
#plt.show()
Out[ ]:
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
 [Text(0, 0, 'landing'),
  Text(1, 0, 'enroute'),
  Text(2, 0, 'takeoff'),
  Text(3, 0, 'approach'),
  Text(4, 0, 'initial climb'),
  Text(5, 0, 'maneuvering'),
  Text(6, 0, 'taxi'),
  Text(7, 0, 'standing'),
  Text(8, 0, 'unknown'),
  Text(9, 0, 'emergency descent'),
  Text(10, 0, 'uncontrolled descent'),
  Text(11, 0, 'pushback/tow'),
  Text(12, 0, 'after landing'),
  Text(13, 0, 'post-impact')])
In [ ]:
# getting the total count of flights by make 
sumByMake = results.groupby(['Make','BroadPhaseofFlight'])['Count'].sum().reset_index()

sumByMake = sumByMake.sort_values(by='Count', ascending=False)
# calculating the sum of each make 
totalSum = sumByMake.groupby('Make')['Count'].sum().reset_index()
# sorting and column renaming 
totalSum = totalSum.sort_values(by='Count', ascending=False)
totalSum = totalSum.rename(columns={'Count': 'Total'})
sumByMake = sumByMake.merge(totalSum, on='Make')
# calculating the percentage of each phase 
sumByMake['Percent'] = (sumByMake['Count']/ sumByMake['Total']) *100
#print(sumByMake)
# to visually display the data correctly getting the top 5 makes that are represented in the data 
top_makes = sumByMake.groupby('Make')['Count'].sum().nlargest(5).index
top5 = sumByMake[sumByMake['Make'].isin(top_makes)]

#print(top5)
# creating the bar plot and setting the axis' labels and title
plot = sns.barplot(x='Make', y='Percent', hue='BroadPhaseofFlight', data=top5, palette='Set2')
plot.set_xlabel("Top 5 Airplane Makes Involved")
plot.set_ylabel("Total incidents per phase(%)")
plot.set_title("Figure 1: Phase of Flight Incident Analysis based upon Airplane Make")
plt.xticks(rotation=90)
plt.legend(title="Phase of Flight", loc= 'upper right', bbox_to_anchor=(1.8, 1))
plt.show()
No description has been provided for this image
In [ ]:
# since our term project only focuses on commercial airlines, I narrowed down data reflecting commercial flights which are Boeing and Airbus
commercial = results[results['Make'].str.contains('Boeing|Airbus', case=False, na=False)]
# including the data that may contain hypenated airline makes 
commercial['Make'] = commercial['Make'].str.split().str[0]
commercial['Make'] = commercial['Make'].str.split('-').str[0]
# print(commercial)
# getting a sum which is grouped by the make 
sumByMake = commercial.groupby(['Make','BroadPhaseofFlight'])['Count'].sum().reset_index()

# sumByMake = sumByMake.sort_values(by='Count', ascending=False)
# getting a total sum to calculate the percentage 
totalSum = sumByMake.groupby('Make')['Count'].sum().reset_index()
# renaming columns 
totalSum = totalSum.rename(columns={'Count': 'Total'})
sumByMake = sumByMake.merge(totalSum, on='Make')
sumByMake['Percent'] = (sumByMake['Count']/ sumByMake['Total']) *100
#print(sumByMake)
# creating the bar plot and renamed the axis' 
plot = sns.barplot(x='BroadPhaseofFlight', y='Percent', hue='Make', data=sumByMake, palette='viridis')
plot.set_xlabel("Phase of Flight")
plot.set_ylabel("Total incidents per phase(%)")
plot.set_title("Figure 2: Phases of Flight Accidents based upon Airline Manufacturer")
plt.xticks(rotation=90)

# # 9) Customize legend if necessary
plt.legend(title="Commercial Airline Manufacturer", loc= 'upper right',  bbox_to_anchor=(1.5, 1))
plt.show()
No description has been provided for this image

Yesui's Visualization¶

Yesui’s goal was to determine which operators had the most accidents. Given our data, first she cleaned it up by removing any lines with missing data then by cleaning up the operators. She then standardized the operator names, addressing inconsistencies such as "Delta Airlines" appearing in multiple forms (e.g., "Delta Air Lines," "Delta Airlines, Inc," etc.). By consolidating these variations into a single, consistent name, she ensured more accurate and meaningful analysis. Next, she filtered the dataset to include only the major U.S. airlines: United Airlines, American Airlines, Southwest Airlines, Delta Airlines, JetBlue Airways, Spirit Airlines, and Frontier Airlines. Based on this refined data, United Airlines appeared to have the highest number of flight incidents as a percentage of the group. However, this percentage alone doesn’t tell the full story. Understanding the total number of flights operated by each airline is crucial—an airline with more flights will naturally have more opportunities for incidents. For example, if United operates far more flights per day than Spirit, its higher incident count may reflect its volume rather than a higher rate of incidents per flight. Incorporating flight volume data would allow for a more accurate comparison of incident rates.

In [ ]:
# Load CSV file
df = pd.read_csv('Merged10yrdata.csv', low_memory=False)

# Fill missing operator values with 'N/A' and remove them
df = df.copy()  # Ensure we're modifying the original DataFrame
df['Operator'] = df['Operator'].fillna('N/A')  # Assign explicitly instead of using inplace=True
df = df[~df['Operator'].isin(['N/A', 'Not reported', 'Unknown', 'Private', 'Private Individual', 'Pilot', 'Not Provided by Authority'])]
# Count accidents per operator
df['Operator'] = df['Operator'].str.strip()  # Removes leading/trailing whitespace
df['Operator'] = df['Operator'].str.replace('\t', '', regex=False)  # Removes tabs
df['Operator'] = df['Operator'].str.lower()  # Convert to lowercase for consistency

df['Operator'] = df['Operator'].replace({
    'united airlines': 'united airlines',
    'united airlines inc': 'united airlines',
    'united air lines inc': 'united airlines', 
    'southwest airlines co': 'southwest airlines',
    'delta air lines': 'delta airlines',
    'delta air lines inc': 'delta airlines',
    'delta air lines, inc': 'delta airlines',
    'delta airlines inc': 'delta airlines',
    'delta airlines, inc.': 'delta airlines',
    'delta airlines inc': 'delta airlines',
    'american airlines': 'american airlines',
    'american airlines inc': 'american airlines',
    'ryanair': 'ryanair',
    'ryan air': 'ryanair',
    'civil air patrol': 'civil air patrol',
    'civil air patrol inc': 'civil air patrol',
    'air methods corp': 'air methods',
    'air methods corporation': 'air methods',
    'indigo': 'indigo',
    'indigo air': 'indigo', 
    'jetblue airways': 'jetblue airways',
    'jetblue airways corp': 'jetblue airways',
    'jetblue corp': 'jetblue airways',
    'spirit airlines inc': 'spirit airlines',
    'spirit airlines': 'spirit airlines'
}, regex=True)

df['Operator'] = df['Operator'].str.title()

accident_counts = df['Operator'].value_counts()
accident_counts = accident_counts[accident_counts > 4] 

# Define the operators you want to focus on
selected_operators = ['Delta Airlines', 'American Airlines', 'Southwest Airlines', 'Frontier Airlines', 'Spirit Airlines', 'United Airlines', 'Jetblue Airways']
# Filter the data to include only the selected operators
filtered_data = df[df['Operator'].isin(selected_operators)]

# Count accidents per operator for the filtered data
select_accident_counts = filtered_data['Operator'].value_counts()

# Calculate the total number of accidents for the selected operators
total_accidents = select_accident_counts.sum()

# Calculate the percentage for each operator
select_accident_percentages = (select_accident_counts / total_accidents) * 100

# Create the bar chart
plt.figure(figsize=(10, 6))  # Adjust the size of the figure
plt.bar(select_accident_percentages.index, select_accident_percentages.values, color='skyblue')
#plt.bar(accident_counts.index, accident_counts.values, color='skyblue')
plt.xlabel('Flight Operator')
plt.ylabel('Percent of Accidents (%)')
plt.title('Figure 3: Percent of Accidents for Selected Airlines')
plt.xticks(rotation=90)  # Rotate the x-axis labels for better visibility
plt.show()
No description has been provided for this image

Lora's Visualization¶

My goal was to visualize the number of incidents for US commercial airlines over the past decade, and see whether there is an increase in incidents. In Figure 4.1, it does seem like there is an uptick in incidents that resulted in no injuries for the past couple of years. 2020 observed the lowest number of incidents, which is likely the result of the COVID-19 pandemic. 2025 shows a low number of incidents, but that is because 2025 hasn’t finished yet, so we simply don’t have the data for the rest of the year. Because of this, in Figure 4.2, I plotted the cumulative number of incidents over the first 150 days of the year to see how 2025 is doing compared to the start of previous years. For both 2024 and 2025, there does seem to be a higher than average incident count for the first 50 days of the year. Based off of these visualizations, it’s possible that there actually is an uptick in the number of incidents in recent years, but since the difference is only by 1-2 incidents, it isn’t very conclusive.

In [ ]:
df = pd.read_csv('Merged10yrdata.csv', low_memory=False)
df = df[(df['Country'] == 'United States')]
commercial_makes = [
    'BOEING', 'Airbus', 'Boeing', 'AIRBUS', 
    'AIRBUS INDUSTRIE', 'BOEING, AIRBUS', 'AIRBUS SAS'
]

df['isCommercial'] = df['Make'].isin(commercial_makes)
#df

df = df[['EventDate','FatalInjuryCount', 'SeriousInjuryCount',
       'MinorInjuryCount', 'HighestInjuryLevel', 'isCommercial']]
df = df[df['isCommercial'] == True]

df['HighestInjuryLevel'] = df['HighestInjuryLevel'].fillna("None")
#df

df_injuries = df[['EventDate','FatalInjuryCount', 'SeriousInjuryCount',
       'MinorInjuryCount']]

df_injuries['EventDate'] = pd.to_datetime(df_injuries['EventDate'])
df_injuries.set_index('EventDate', inplace=True)
#df_injuries
 
df_highestInc = df
df_highestInc['EventDate'] = pd.to_datetime(df_highestInc['EventDate']) # convert event date to datetime format
df_highestInc['year'] = df_highestInc['EventDate'].dt.year # extract the year

# get cartesian groped of yearsXhighstinjurylevel so that we can fill in rows with 0 incidents
all_years = df_highestInc['year'].unique()
all_levels = df_highestInc['HighestInjuryLevel'].unique()
full_index = pd.MultiIndex.from_product([all_years, all_levels], names=['year', 'HighestInjuryLevel'])

# group by year and count number of incidents per injury level
df_highestInc_grouped = df_highestInc.groupby(['year','HighestInjuryLevel']).size().reset_index(name='incidentCount').set_index(['year', 'HighestInjuryLevel']) # group by injury level and count number of incidents
df_highestInc_grouped_full = df_highestInc_grouped.reindex(full_index, fill_value=0).reset_index()


colors = {
    'Fatal': 'dimgrey',
    'Serious': 'indianred',
    'Minor': 'orange',
    'None': 'royalblue',
}


plt.figure(figsize=(12, 6))
sns.lineplot(data=df_highestInc_grouped_full, x='year', y='incidentCount', hue='HighestInjuryLevel', palette=colors, linewidth=2,  hue_order=['Fatal', 'Serious', 'Minor', 'None'])

plt.xlabel("Year")
plt.ylabel("Number of Incidents")
plt.title("Figure 4.1: Incidents Over Time by Highest Injury Level")
plt.xticks(rotation=45) 
plt.xticks(range(2015,2026))
# plt.xlim(2015,2025)
plt.legend(title="Highest Injury Level", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.grid(True)
plt.show()
No description has been provided for this image
In [ ]:
df_cumInc = df

df_cumInc['EventDate'] = pd.to_datetime(df['EventDate'])
df_cumInc['year'] = df['EventDate'].dt.year
df_cumInc['dayOfYear'] = df['EventDate'].dt.dayofyear

df_grouped = df_cumInc.groupby(['year','dayOfYear']).size().reset_index(name='incidentCount')
# df_grouped['cumulativeIncidents'] = df_grouped.groupby('year')['incidentCount'].cumsum()

# create an index for every (year, dayOfYear) combination
full_index = []
for year in df_grouped['year'].unique():
    max_day = 366 if pd.Timestamp(f"{year}-12-31").is_leap_year else 365
    full_index.extend([(year, day) for day in range(1, max_day + 1)])

# fill in missing days
df_full = pd.DataFrame(full_index, columns=['year', 'dayOfYear'])
df_merged = pd.merge(df_full, df_grouped, on=['year', 'dayOfYear'], how='left')
df_merged['incidentCount'] = df_merged['incidentCount'].fillna(0)

# compute cumulative incidents per year
df_merged['cumulativeIncidents'] = df_merged.groupby('year')['incidentCount'].cumsum()

df_merged
plt.figure(figsize=(12, 6))

for year in range(2015, 2026): 
    # yearly_data = df_merged[df_merged['year'] == year]

    
    yearly_data = df_merged[(df_merged['year'] == year) & (df_merged['dayOfYear'] <= 150)]
    yearly_data['smoothed'] = yearly_data['cumulativeIncidents'].rolling(window=100, min_periods=1).mean()
 
    if year == 2025:
        yearly_data = yearly_data[yearly_data['dayOfYear'] <= 55]
        plt.plot(yearly_data['dayOfYear'], yearly_data['smoothed'], label=str(year), color='red', linewidth=2.5)
    if year == 2024:
        # yearly_data = yearly_data[yearly_data['dayOfYear'] <= 55]
        plt.plot(yearly_data['dayOfYear'], yearly_data['smoothed'], label=str(year), color='lightcoral', linewidth=2.5)
    else:
        plt.plot(yearly_data['dayOfYear'], yearly_data['smoothed'], color='gray', linewidth=1, alpha=0.5)

plt.xlabel("Day of Year")
plt.ylabel("Cumulative Number of Incidents")
plt.yticks(range(0,20,2))
plt.xticks(range(0,151,10))
plt.xlim(0,150)
plt.title("Figure 4.2: Cumulative Airplane Incidents Over the Year")
plt.legend(title="Year", bbox_to_anchor=(1.05, 1), loc="upper left") 
# plt.grid(True)
plt.show()
No description has been provided for this image

Oli's Visualization¶

Oli wanted to explore the hypothesis of “Reports of US aviation incidents are being published more quickly in recent years, indicating increased scrutiny.” Oli first calculated and visualized the time delay (in years) between when a US aviation incident occurred and when its official report was published. Figure 5.1 shows this, displaying how long it typically takes for reports to be published and tells us whether a lot of the reports are delayed significantly. From this figure, we see that half of the total reports filed are published within a year, exponentially decreasing with each passing year. Next, Oli grouped each report delay into bins of 120 days and grouping by event year to track how report delay distributions have changed over time. Figure 5.2 shows this in a heatmap, mostly revealing that long reporting times were steadily increasing prior to 2022 but due to lack of recent years’ reports (mainly because of the recency of the events which resulted in the large amount of unfiled data), it is hard to definitely confirm a trend of acceleration yet. Also, the jump increase in 2019’s late report filings probably had to do with the covid pandemic where many reports could have been delayed as a result but afterwards there was a speed up in report publishings, indicating an improvement. The hypothesis is therefore partially supported where there is some evidence of improvement since 2019 but the vast percentage of recent unfiled reports makes it hard to draw strong conclusions. Using a machine learning model on existing data prior to 2022 from figure 5.1 and 5.2 can help draw a better conclusion on this hypothesis (next section).

In [ ]:
plane_data = pd.read_csv('Merged10yrdata.csv', low_memory=False)
event_dates = plane_data['EventDate']
publish_dates = plane_data['OriginalPublishedDate'] # OriginalPublishDate

plane_data['EventDate'] = pd.to_datetime(plane_data['EventDate'])
plane_data['OriginalPublishedDate'] = pd.to_datetime(plane_data['OriginalPublishedDate']) # OriginalPublishDate

# only want US data
plane_data = plane_data[plane_data['Country'] == 'United States']

# calculate report delays in years
plane_data['ReportDelay'] = (plane_data['OriginalPublishedDate'] - plane_data['EventDate']).dt.days / 365
num_unpublished = plane_data['ReportDelay'].isna().sum()  # Count missing reports

# drop NaN values
report_delays = plane_data['ReportDelay'].dropna()

# plot
plt.figure(figsize=(10, 6))
bins = np.arange(0, report_delays.max() + 1, 1)  # Bin width of 1 year
plt.hist(report_delays, bins=bins, edgecolor='black', alpha=0.7)

plt.xlabel("Years Until Report Published")
plt.ylabel("Number of Reports")
plt.title("Figure 5.1: Distribution of US Report Delays (in Years)")

# Add caption for unpublished reports
plt.figtext(0.50, 0.60, f"Number of reports never filed: {num_unpublished}", fontsize=12, color='red')

plt.show()
No description has been provided for this image
In [ ]:
# calculate the delay between event and report
plane_data['ReportDelay'] = (plane_data['OriginalPublishedDate'] - plane_data['EventDate']).dt.days # OriginalPublishDate

# create bins for delays
bins = [-np.inf, 120, 240, 360, 480, 600, 720, 840, np.inf]
labels = ['1-120', '121-240', '241-360', '361-480', '481-600', '601-720', '721-840', '840+']
plane_data['DelayCategory'] = pd.cut(plane_data['ReportDelay'], bins=bins, labels=labels)

# assign 'Not Filed' for missing values
plane_data['DelayCategory'] = plane_data['DelayCategory'].cat.add_categories(['Not Filed']).fillna('Not Filed')

# group by event year and delay category
heatmap_data = plane_data.groupby([plane_data['EventDate'].dt.year, 'DelayCategory']).size().unstack()

bins = [-np.inf, 120, 240, 360, 480, 600, 720, 840, np.inf]
labels = ['1-120', '121-240', '241-360', '361-480', '481-600', '601-720', '721-840', '840+']
plane_data['DelayCategory'] = pd.cut(plane_data['ReportDelay'], bins=bins, labels=labels)

# assign 'Not Filed' for missing values
plane_data['DelayCategory'] = plane_data['DelayCategory'].cat.add_categories(['Not Filed']).fillna('Not Filed')

# group by event year and delay category
heatmap_data = plane_data.groupby([plane_data['EventDate'].dt.year, 'DelayCategory']).size().unstack()

# calculate by percentage
heatmap_data_percentage = heatmap_data.div(heatmap_data.sum(axis=1), axis=0) * 100

heatmap_data_percentage

# resize
plt.figure(figsize=(10, 6))

# create heatmap
sns.heatmap(heatmap_data_percentage, annot=True, fmt=".1f", cmap="Blues", linewidths=0.5)

# labels and title
plt.xlabel("Report Delay (in days)")
plt.ylabel("Event Year")
plt.title("Figure 5.2: Percentage of US Reports by Delay Category per Year")

# plot
plt.show()
No description has been provided for this image

Nav's Visualization¶

My Visualization aimed at getting a broad sense of where accidents are occuring and sensing whether we can attribute accidents to location. When plotting, however, we find that the USA and Europe have the most accidents. A basic analysis might suggest that accidents occur most over these regions due to some intrincsic factor to the USA; this might be the case due to how much traffic the US gets. The same is true for Europe; these regions are the main Economic, Tourist, and otherwise main hubs of the world.

In [2]:
from IPython.display import display
from PIL import Image

path="nav.png"
display(Image.open(path))
No description has been provided for this image

Machine Learning Analysis¶

Lora's ML¶

I created a decision tree with the goal of trying to predict the level of injury for a given incident based on certain features from the flight data: phase of flight, weather at time of incident, month, and make of aircraft (Boeing or Airbus only). Our goal was to see if any of these variables are good predictors of the severity of an incident, if one were to occur, which could potentially help airline staff and passengers be better prepared during emergency situations. My initial assumption was that certain flight phases (taxing vs in route) and weather conditions (clear weather vs heavy thunderstorm) would result in more severe incidents and delayed emergency responses, which could make them good predictors for the severity of injuries.

For the weather data, I used the python package Meteo to obtain weather data (ranging from ‘clear’ to ‘storming’) for each incident, since it wasn’t included in the dataset.

For the evaluation of this model, I used accuracy, precision recall, and the f1 score. I tried adjusting various parameters such as the depth of the tree, the criterion for the decision tree classifier, but I was unable to achieve a high score for any of the scores. The average accuracy score hovered around 0.60, and the average f1 score was around 0.40. Because of class imbalance, with only 8% of incidents reported either minor or fatal injuries, I also tried a logistic regression on this training set and added weights to balance out the categories, but I achieved similar results.

In conclusion, these features do not seem to be good predictors of the injury level for an incident. I plan to take a closer look at the individual features to check their correlation between them and the injury level, as well as play around with other combinations of features, such as how far the incident occurred from civilization, and breaking the weather data down into amount of rainfall, wind speed, and temperature.

In [ ]:
df = pd.read_csv('Merged10yrdata.csv', low_memory=False)

# categorize incidents by make (Airbus and Boeing only)
commercial_makes = [
    'BOEING', 'Airbus', 'Boeing', 'AIRBUS', 
    'AIRBUS INDUSTRIE', 'BOEING, AIRBUS', 'AIRBUS SAS'
]
df['isCommercial'] = df['Make'].isin(commercial_makes)

# convert date to month
df['EventDate'] = pd.to_datetime(df['EventDate'])
df['Month'] = df['EventDate'].dt.month

# convert weather condition to something more readable


# filter out unnecessary columns
df = df[(df['Country'] == 'United States') & df['isCommercial'] == True]
df = df[['EventDate', 'Latitude', 'Longitude ', 'City', 'State', 'WeatherCondition', 'BroadPhaseofFlight', 'Make', 'Month', 'HighestInjuryLevel']]


# fill in blank injury level columns with None
df['HighestInjuryLevel'] = df['HighestInjuryLevel'].fillna("None")
In [ ]:
!pip install meteostat
Requirement already satisfied: meteostat in c:\users\yesui\anaconda3\lib\site-packages (1.6.8)
Requirement already satisfied: pandas>=1.1 in c:\users\yesui\anaconda3\lib\site-packages (from meteostat) (2.2.2)
Requirement already satisfied: pytz in c:\users\yesui\anaconda3\lib\site-packages (from meteostat) (2024.1)
Requirement already satisfied: numpy in c:\users\yesui\anaconda3\lib\site-packages (from meteostat) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\yesui\anaconda3\lib\site-packages (from pandas>=1.1->meteostat) (2.9.0.post0)
Requirement already satisfied: tzdata>=2022.7 in c:\users\yesui\anaconda3\lib\site-packages (from pandas>=1.1->meteostat) (2023.3)
Requirement already satisfied: six>=1.5 in c:\users\yesui\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.1->meteostat) (1.16.0)
In [ ]:
from datetime import datetime
from meteostat import Point, Hourly, Stations
In [ ]:
import warnings
warnings.simplefilter('ignore')

def fetch_weather(row):
    ts = row['EventDate'].tz_localize(None)
    location = Point(row['Latitude'], row['Longitude '])
    location.radius = 100000
    weather_data = Hourly(location, ts, ts + pd.Timedelta(hours=1)).fetch()

    if weather_data.empty:
        return pd.Series([None, None, None])
    else:
        return pd.Series([weather_data['prcp'].iloc[0], weather_data['wspd'].iloc[0], weather_data['coco'].iloc[0]])
In [ ]:
df_weather = df
df_weather[['precip', 'windSpeed', 'weatherCode']] = df_weather.apply(fetch_weather, axis=1)
In [ ]:
df_weather = df_weather.dropna(subset=['weatherCode'])
In [ ]:
print(df_weather['BroadPhaseofFlight'].unique())
print(df_weather['Make'].unique())
print(df_weather['HighestInjuryLevel'].unique())
['Enroute' 'Approach' 'Initial Climb' 'Landing' 'Standing' 'Pushback/Tow'
 'Takeoff' 'Taxi' 'Maneuvering']
['BOEING' 'AIRBUS' 'AIRBUS INDUSTRIE' 'Airbus' 'Boeing' 'BOEING, AIRBUS'
 'AIRBUS SAS']
['Serious' 'Fatal' 'None' 'Minor']
In [ ]:
df_weather = df_weather[df_weather['Make'] != 'BOEING, AIRBUS']
In [ ]:
# map columns to numerical values because they can't be categorial
phase_mapping = {
    'Enroute': 0,
    'Approach': 1,
    'Initial Climb': 2,
    'Landing': 3,
    'Standing': 4,
    'Pushback/Tow': 5,
    'Takeoff': 6,
    'Taxi': 7,
    'Maneuvering': 8
}

make_mapping = {
    'BOEING': 0,
    'Boeing': 0,
    'AIRBUS': 1,
    'AIRBUS INDUSTRIE': 1,
    'Airbus': 1,
    'AIRBUS SAS': 1
}

injury_mapping = {
    'Fatal': 0,
    'Serious': 1,
    'Minor': 2,
    'None': 3,
}

weather_code_map = {
    1: "Clear",
    2: "Fair",
    3: "Cloudy",
    4: "Overcast",
    5: "Fog",
    6: "Freezing Fog",
    7: "Light Rain",
    8: "Rain",
    9: "Heavy Rain",
    10: "Freezing Rain",
    11: "Heavy Freezing Rain",
    12: "Sleet",
    13: "Heavy Sleet",
    14: "Light Snowfall",
    15: "Snowfall",
    16: "Heavy Snowfall",
    17: "Rain Shower",
    18: "Heavy Rain Shower",
    19: "Sleet Shower",
    20: "Heavy Sleet Shower",
    21: "Snow Shower",
    22: "Heavy Snow Shower",
    23: "Lightning",
    24: "Hail",
    25: "Thunderstorm",
    26: "Heavy Thunderstorm",
    27: "Storm"
}

df_weather['flightPhase'] = df['BroadPhaseofFlight'].map(phase_mapping)
df_weather['make'] = df['Make'].map(make_mapping)
df_weather['injuryLevel'] = df['HighestInjuryLevel'].map(injury_mapping)
df_weather['weatherDescription'] = df_weather['weatherCode'].map(weather_code_map)
In [ ]:
# check if classes are imbalanced
df_weather['HighestInjuryLevel'].value_counts()
Out[ ]:
HighestInjuryLevel
None       88
Serious    74
Fatal       7
Minor       7
Name: count, dtype: int64

- Decision Tree¶

Goal: predict the highest injury level if an incident were to occur based on certain flight conditions: phase of flight, weather, time of year, and make of aircraft (Boeing or Airbus)

The purpose of this model is to identify conditions that are more likely to result in serious or fatal injuries.

In [ ]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
In [ ]:
features = ['Month', 'weatherCode', 'flightPhase', 'make']
# features = ['BroadPhaseofFlight', 'Make', 'Month', 'weatherDescription']

X = df_weather[features]
# X = pd.get_dummies(df_weather[features])
y = df_weather['injuryLevel']

X_train, X_test, y_train, y_test = train_test_split(X, y)

clf = DecisionTreeClassifier(criterion="entropy", max_depth=5)
clf.fit(X_train, y_train)

plt.figure(figsize=(50, 30), dpi=300)
tree.plot_tree(clf, feature_names=features, filled=True, precision=0, class_names=['Fatal','Serious','Minor','None'], fontsize=17)
# tree.plot_tree(clf, filled=True, precision=0, fontsize=10)
plt.savefig('decision_tree.png')

plt.show()
No description has been provided for this image
In [ ]:
y_pred = clf.predict(X_test)

# evaluating our model
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred, average='macro'))
print("Recall:",metrics.recall_score(y_test, y_pred, average='macro'))
print("f1_score:",metrics.f1_score(y_test, y_pred, average='macro'))
Accuracy: 0.6590909090909091
Precision: 0.3477272727272728
Recall: 0.3472222222222222
f1_score: 0.34431818181818186
In [ ]:
conf_matrix = metrics.confusion_matrix(y_test, y_pred)

plt.figure(figsize=(4,4))
sns.set(font_scale = 1.5)
 
ax = sns.heatmap(
    conf_matrix,
    annot=True,
    fmt='d',
    cbar=False,
    cmap='crest',
    xticklabels=['Fatal','Serious','Minor','None'],
    yticklabels=['Fatal','Serious','Minor','None']
)
 
ax.set_xlabel("Predicted", labelpad=20)
ax.set_ylabel("Actual", labelpad=20)
plt.show()
No description has been provided for this image

Some information on evalutation metrics:

" For datasets with imbalanced classes, a model could achieve high Accuracy by predicting the majority class most of the time

F1 Score is defined as the harmonic mean of Precision and Recall. If any of them becomes extremely low, F1 Score will also go down. Thus, F1 Score can help you find a good balance between Precision and Recall. "

In [ ]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(
    solver='lbfgs',
    class_weight='balanced' # handle imbalanced classes
)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred, average='macro'))
print("Recall:",metrics.recall_score(y_test, y_pred, average='macro'))
print("f1_score:",metrics.f1_score(y_test, y_pred, average='macro'))
Accuracy: 0.22727272727272727
Precision: 0.275
Recall: 0.3541666666666667
f1_score: 0.21521942110177406

Oli's ML¶

Oli wanted to continue the exploration of the hypothesis “Reports of US aviation incidents are being published more quickly in recent years, indicating increased scrutiny” by training a model using the delay data from 2015 to 2022 (2023 onwards has too many reports unfiled which would mess with the model). He first wanted to observe how accurately a model can predict against real reports so he trained a Hist Gradient Boosting classifier. To handle unfiled reports, Oli dropped all of them to focus just on classification. Figure 7.1 shows the confusion matrix along with its statistics of each delay category (same ones from figure 5.2). The model struggles with longer delay ranges (721-840 and 840+) where precision and recall values are zero but it performs better in the 121-240 delay category, with a recall of 0.70. However, the overall accuracy of the model is relatively low at 23%, and the macro and weighted averages further highlight the model's weak performance across all classes. The confusion matrix and classification report suggest that the model is having difficulty generalizing to all delay categories, with the majority of the predictions concentrated in a few categories and a significant number of misclassifications especially in the 1-120 range. Despite this poor classification, Oli decided to train a model to predict future report delays. To handle the low minority of unfiled reports in the 7 year time frame, Oli used hot deck imputation for this model. To model the delay, he extracted simple features (such as EventMonth and EventDayOfWeek) and trained a Hist Gradient Boosting Regressor on this data. The model learns patterns in how publishing delays vary across time, and its performance is evaluated using standard regression metrics such as Mean Absolute Error, Mean Squared Error and R^2 score. Once trained, the model is used to simulate future scenarios by generating synthetic aviation events for each month from 2025 to 2035. These synthetic events are passed through the model to predict publishing delays, which are then averaged by year. Based on figure 7.2, the model showed poor predictive power with an r score of just 0.0022 and predicted consistently high average delays, with no meaningful downward trend over time. These results do not support the original hypothesis which means the large increase in media scrutiny in recent years has not been a result of the publication of reports.

MAE: 259.61 days

RMSE: 326.49 days

R^2: 0.0022

In [ ]:
import pandas as pd
import numpy as np
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
# load and preprocess data (already done from oli_visualizations)
plane_data = pd.read_csv('Merged10yrdata.csv', low_memory=False)
plane_data['EventDate'] = pd.to_datetime(plane_data['EventDate'])
plane_data['OriginalPublishedDate'] = pd.to_datetime(plane_data['OriginalPublishedDate']) # OriginalPublishDate
plane_data = plane_data[plane_data['Country'] == 'United States']
plane_data['ReportDelay'] = (plane_data['OriginalPublishedDate'] - plane_data['EventDate']).dt.days

# bin the delay
bins = [-np.inf, 120, 240, 360, 480, 600, 720, 840, np.inf]
labels = ['1-120', '121-240', '241-360', '361-480', '481-600', '601-720', '721-840', '840+']
plane_data['DelayCategory'] = pd.cut(plane_data['ReportDelay'], bins=bins, labels=labels)
plane_data['DelayCategory'] = plane_data['DelayCategory'].cat.add_categories(['Not Filed']).fillna('Not Filed')

# extraction
plane_data['DelayDays'] = (plane_data['OriginalPublishedDate'] - plane_data['EventDate']).dt.days
plane_data['EventYear'] = plane_data['EventDate'].dt.year
plane_data['EventMonth'] = plane_data['EventDate'].dt.month
plane_data['EventDayOfWeek'] = plane_data['EventDate'].dt.dayofweek

# encode delay category
le = LabelEncoder()
plane_data['DelayClass'] = le.fit_transform(plane_data['DelayCategory'])

# train on data from 2015–2021
train_df = plane_data[(plane_data['EventYear'] >= 2015) & (plane_data['EventYear'] <= 2021)]
predict_df = plane_data[plane_data['EventYear'] >= 2022]

feature_cols = ['EventYear', 'EventMonth', 'EventDayOfWeek']
X_train = train_df[feature_cols]
y_train = train_df['DelayClass']

# train the model using HistGradientBoostingClassifier
model = HistGradientBoostingClassifier(max_iter=100, learning_rate=0.1)
model.fit(X_train, y_train)

# prepare features for prediction data (2023 onwards)
X_predict = predict_df[feature_cols]

# predict classes for the future data
predicted_classes = model.predict(X_predict)
predict_df['PredictedClass'] = predicted_classes
predict_df['PredictedDelayCategory'] = le.inverse_transform(predicted_classes)

# evaluate the model on known data (2022–2023)
eval_df = predict_df[predict_df['DelayCategory'] != 'Not Filed'] 
if not eval_df.empty:
    # encode the true labels using LabelEncoder
    y_true = le.transform(eval_df['DelayCategory'])
    y_pred = eval_df['PredictedClass']
    
    # get the target class names excluding 'Not Filed'
    target_classes = le.classes_
    target_classes = target_classes[target_classes != 'Not Filed']
    
    # print classification report for evaluation
    print("\nClassification Report (2022–2023 known reports):")
    print(classification_report(y_true, y_pred, target_names=target_classes))

    # confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(10, 6))
    sns.heatmap(cm, annot=True, fmt='d', xticklabels=target_classes, yticklabels=target_classes, cmap='Blues')
    plt.xlabel("Predicted Delays in Publication")
    plt.ylabel("Actual Delays in Publication")
    plt.title("Figure 7.1: Confusion Matrix for Delays in Publication (2022–2023)")
    plt.show()

# export predictions for future data if needed
# predict_df.to_csv('predicted_report_delays_2022_onward.csv', index=False)
C:\Users\yesui\AppData\Local\Temp\ipykernel_21156\2325903038.py:41: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predict_df['PredictedClass'] = predicted_classes
C:\Users\yesui\AppData\Local\Temp\ipykernel_21156\2325903038.py:42: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predict_df['PredictedDelayCategory'] = le.inverse_transform(predicted_classes)
Classification Report (2022–2023 known reports):
              precision    recall  f1-score   support

       1-120       0.28      0.10      0.14       959
     121-240       0.24      0.70      0.36       717
     241-360       0.17      0.07      0.10       261
     361-480       0.09      0.02      0.03       214
     481-600       0.12      0.06      0.08       317
     601-720       0.04      0.03      0.03       194
     721-840       0.00      0.00      0.00       157
        840+       0.00      0.00      0.00         2

    accuracy                           0.23      2821
   macro avg       0.12      0.12      0.09      2821
weighted avg       0.20      0.23      0.16      2821

No description has been provided for this image
In [ ]:
plane_data = pd.read_csv('Merged10yrdata.csv', low_memory=False)

# filter for only US data
plane_data = plane_data[plane_data['Country'] == 'United States']

# convert dates to datetime
plane_data['EventDate'] = pd.to_datetime(plane_data['EventDate'])
plane_data['OriginalPublishedDate'] = pd.to_datetime(plane_data['OriginalPublishedDate'])

# calculate publishing delay in days
plane_data['DelayDays'] = (plane_data['OriginalPublishedDate'] - plane_data['EventDate']).dt.days
plane_data['EventMonth'] = plane_data['EventDate'].dt.month
plane_data['EventDayOfWeek'] = plane_data['EventDate'].dt.dayofweek
plane_data['EventYear'] = plane_data['EventDate'].dt.year

#plane_data.head()
In [ ]:
# fill in not filed reports using hot deck imputation
def hot_deck_impute(df):
    df = df.copy()
    group_median = df.groupby(['EventMonth', 'EventDayOfWeek'])['DelayDays'].median()

    def impute(row):
        if pd.isna(row['DelayDays']):
            return group_median.get((row['EventMonth'], row['EventDayOfWeek']), df['DelayDays'].median())
        return row['DelayDays']

    df['DelayDays'] = df.apply(impute, axis=1)
    return df

plane_data = hot_deck_impute(plane_data)
In [ ]:
features = ['EventMonth', 'EventDayOfWeek']
X = plane_data[features]
y = plane_data['DelayDays']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train HistGradientBoostingRegressor
model = HistGradientBoostingRegressor(max_iter=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

# evaluate model
y_pred = model.predict(X_test)
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R^2:", r2_score(y_test, y_pred))
MAE: 259.60606489898737
RMSE: 326.4940696438964
R^2: 0.0021508599912763993
In [ ]:
future_years = range(2025, 2036)
future_events = []

for year in future_years:
    for month in range(1, 13):
        # generate 5 events per month (random day)
        for _ in range(5):
            fake_date = pd.Timestamp(f'{year}-{month:02d}-{np.random.randint(1, 28)}')
            future_events.append({
                'EventDate': fake_date,
                'EventMonth': fake_date.month,
                'EventDayOfWeek': fake_date.dayofweek,
            })

synthetic_future_df = pd.DataFrame(future_events)

# feature extraction
def extract_features(df):
    df = df.copy()
    df['EventMonth'] = df['EventDate'].dt.month
    df['EventDayOfWeek'] = df['EventDate'].dt.dayofweek
    return df[['EventMonth', 'EventDayOfWeek']] 

X_future = extract_features(synthetic_future_df)

# predict using trained model
predicted_delays = model.predict(X_future)

# add predictions to the DataFrame
synthetic_future_df['PredictedDelayDays'] = predicted_delays
synthetic_future_df['Year'] = synthetic_future_df['EventDate'].dt.year
synthetic_future_df['Month'] = synthetic_future_df['EventDate'].dt.month

# group by year and calculate the mean delay
mean_delays_per_year = synthetic_future_df.groupby('Year')['PredictedDelayDays'].mean()

#mean_delays_per_year
In [ ]:
# plot the mean predicted delays per year
plt.figure(figsize=(10, 6))
plt.plot(mean_delays_per_year.index, mean_delays_per_year.values, marker='o', linestyle='-', color='b')
plt.title('Figure 7.2: Predicted Average Publishing Delay (2025–2035)', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Publishing Delay (days)', fontsize=12)
plt.grid(True)
plt.xticks(mean_delays_per_year.index, rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Nav's ML¶

For my ML analysis, I wanted to understand whether we could sense correlation/causation between cateogires to see how they impact the severity of an airline accident. It's easy to imagine that based on weather patterns in certain parts of the US, the likelihood that a crash there will be more catastrophic. To do this, I chose to do a simple Decision Tree. I filtered for:

  • Make: The aircraft's manufactures (filtered for aircraft)
  • Longitude and Latitude: The locations of the aircraft (filtered by country to limit accidents to those in the USA)
  • Weather Condition: The severity of the weathter at the time of the accident

I then aimed to predict the AirCraftDamage column, which classified damage as either minor or substatntial.

In [11]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif
import numpy as np

Choosing the Columns to Focus on¶

  • I'll be choosing the following columns as predictors: Make, Longitude, Latitude, and WeatherCondition
  • I'll be choosing the following column to predict: AircraftDamage
In [13]:
df = pd.read_csv("Merged10yrdata.csv", low_memory=False)
df.columns = df.columns.str.strip()  

df = df[df['Country'] == 'United States']

df = df[['Make', 'Longitude', 'Latitude', 'WeatherCondition', 'AirCraftDamage']]

df = df[df['Make'].isin([
    'BOEING', 'Airbus', 'Boeing', 'AIRBUS', 
    'AIRBUS INDUSTRIE', 'BOEING, AIRBUS', 'AIRBUS SAS'
])]
df
Out[13]:
Make Longitude Latitude WeatherCondition AirCraftDamage
129 BOEING -153.843048 22.608888 IMC NaN
218 BOEING -78.502502 41.412498 VMC Substantial
223 AIRBUS -150.870834 61.228610 VMC Substantial
333 BOEING -117.383056 34.597499 VMC Minor
429 AIRBUS -114.125550 46.251667 VMC Substantial
... ... ... ... ... ...
16256 BOEING NaN NaN NaN NaN
16265 BOEING NaN NaN NaN NaN
16266 BOEING NaN NaN NaN NaN
16362 BOEING NaN NaN NaN NaN
16400 BOEING NaN NaN VMC NaN

325 rows × 5 columns

Data Prepping, Clearing, and Filtering¶

In [15]:
df['AirCraftDamage'].unique()
Out[15]:
array([nan, 'Substantial', 'Minor', 'Destroyed', 'Minor, Substantial',
       'Substantial, Substantial', 'None, None', 'Unknown'], dtype=object)
In [17]:
print(df['AirCraftDamage'].value_counts())
AirCraftDamage
Substantial                 129
Minor                        32
Destroyed                     5
Minor, Substantial            1
Substantial, Substantial      1
None, None                    1
Unknown                       1
Name: count, dtype: int64
In [19]:
df = df[df["AirCraftDamage"].isin(
    ['Substantial', 'Minor', 'Destroyed']
)] # Removing unknown and nones

df = df[df["WeatherCondition"].isin(['VMC', 'IMC'])] # Removing Unknown and None

df
Out[19]:
Make Longitude Latitude WeatherCondition AirCraftDamage
218 BOEING -78.502502 41.412498 VMC Substantial
223 AIRBUS -150.870834 61.228610 VMC Substantial
333 BOEING -117.383056 34.597499 VMC Minor
429 AIRBUS -114.125550 46.251667 VMC Substantial
434 AIRBUS -90.259162 29.993333 VMC Minor
... ... ... ... ... ...
15920 BOEING -97.243592 33.370991 VMC Substantial
16003 BOEING -97.199097 33.201961 VMC Substantial
16004 BOEING -76.718454 37.239960 VMC Substantial
16111 BOEING -82.410832 38.434006 VMC Substantial
16169 BOEING -116.972440 32.826222 VMC Substantial

148 rows × 5 columns

ML Analysis - Decision Tree¶

Now that the data has been cleaned, we run an analysis

In [21]:
from sklearn import tree
from sklearn.model_selection import train_test_split

my_tree = tree.DecisionTreeClassifier()

X = df[['Make', 'Longitude', 'Latitude', 'WeatherCondition']]
y = df['AirCraftDamage']

X = pd.get_dummies(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

my_tree.fit(X_train, y_train)
Out[21]:
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
In [ ]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

plt.figure(figsize=(20, 10))

plot_tree(
    my_tree,
    feature_names=X.columns,           
    class_names=my_tree.classes_,      
    filled=True,                       
    rounded=True                       
)

plt.show()
No description has been provided for this image
In [ ]:
from sklearn.metrics import classification_report

# Predict only on the test set
y_pred = my_tree.predict(X_test)

# Now compare the same-sized arrays
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

   Destroyed       0.00      0.00      0.00         1
       Minor       0.00      0.00      0.00         5
 Substantial       0.82      0.90      0.86        31

    accuracy                           0.76        37
   macro avg       0.27      0.30      0.29        37
weighted avg       0.69      0.76      0.72        37

Reflection¶

The most challenging part of our project that we have encountered so far is understanding what we can actually do with all the data we have. It is excellent that there is so much data regarding the incidents, however we quickly learned that understanding the factors behind these incidents is not black and white. We also didn’t consider the aspect of the flights that occur on a day-to-day basis that do not result in incidents and we learned that gathering that data would be a far greater task than we had originally anticipated.

We believe we are generally on track with our project, especially in terms of collecting and analyzing data related to flight incidents. Most of our work so far has focused on investigating patterns in aviation accidents using datasets from the NTSB and FAA. However, we’ve realized that we haven't spent as much time exploring the media attention side of our research question.

To fully address whether the rise in concern is due to actual increases in incidents or increased scrutiny, we need to dedicate more time to analyzing trends in media coverage. This could include looking at the frequency and tone of news articles, social media posts, and public reactions over time. Integrating this data will help us better understand the relationship between real-world events and public perception.

However, this proves to be one of the biggest challenges we are currently facing is compiling and analyzing data related to the frequency and tone of news articles and public reactions. Unlike aviation incident data, which is already structured and readily available, media coverage is more difficult to quantify. Tracking how often incidents are reported, identifying which stories gain traction, and assessing the tone or framing of articles would require sifting through extremely large amounts of text. Especially, in today’s age where almost everyone is receiving their information through social media such as TikTok which has millions of videos being posted daily, the sheer volume of content makes this difficult. And with the heavy impact of AI and fear mongering on social media, we are unable to determine how much of what we are seeing is real or not regarding what we are actually seeing.

Going forward, we need to determine if there is a manageable way to incorporate media data, or if our research question and scope needs to be modified. We would also like to further our understanding by incorporating further statistical analyses and better machine learning analyses.

The hypothesis “Reports of US aviation incidents are being published more quickly in recent years, indicating increased scrutiny” (using the data points of report publication and event date) was rejected by both the visualization and machine learning model. We will need to find other variables in our dataset that might correlate to our main idea.